Introduction

Dataset from the U.S. Small Business Administration (SBA). For this case-study assignment, students assume the role of loan officer at a bank and are asked to approve or deny a loan by assessing its risk of default using logistic regression.

Background and Description of Datasets

The U.S. SBA was founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market. Small businesses have been a primary source of job creation in the United States; therefore, fostering small business formation and growth has social benefits by creating job opportunities and reducing unemployment. One way SBA assists these small business enterprises is through a loan guarantee program which is designed to encourage banks to grant loans to small businesses. SBA acts much like an insurance provider to reduce the risk for a bank by taking on some of the risk through guaranteeing a portion of the loan. In the case that a loan goes into default, SBA then covers the amount they guaranteed.

There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans. The rate of default on these loans has been a source of controversy for decades. Conservative economists believe that credit markets perform efficiently without government participation. Supporters of SBA-guaranteed loans argue that the social benefits of job creation by those small businesses receiving government-guaranteed loans far outweigh the costs incurred from defaulted loans.

Since SBA loans only guarantee a portion of the entire loan balance, banks will incur some losses if a small business defaults on its SBA-guaranteed loan. Therefore, banks are still faced with a difficult choice as to whether they should grant such a loan because of the high risk of default. One way to inform their decision making is through analyzing relevant historical data such as the datasets provided here.

https://amstat.tandfonline.com/doi/full/10.1080/10691898.2018.1434342#.W4TAIi2B00o

!pip install geopandas==0.3.0 --user !pip install pyshp==1.2.10 --user !pip install shapely==1.6.3 --user
In [1]:
from IPython.core.display import display, HTML
display(HTML("""<style> .container {width:96% !important;}</style>"""))

from IPython.display import IFrame
In [2]:
import sys
sys.path.insert(0,'../')
from utils.paths import *
In [3]:
import pandas as pd
import numpy as np
from plotly.offline import init_notebook_mode, iplot
import cufflinks as cf
init_notebook_mode()
cf.go_offline()
from __future__ import division
In [4]:
!aws s3 ls s3://eh-home/ehda-calvin/SBA_study/
                           PRE pdf/
2018-10-29 08:48:12      16185 7a_504_FOIA Data Dictionary.xlsx
2018-10-29 08:48:13   30287320 FOIA - 504 (FY1991-Present).xlsx
2018-10-29 08:48:15   55120216 FOIA - 7(a)(FY1991-FY1999).xlsx
2018-10-29 08:48:17  116444478 FOIA - 7(a)(FY2000-FY2009).xlsx
2018-10-29 08:48:19   84757645 FOIA - 7(a)(FY2010-Present).xlsx
2018-10-26 03:13:21  179430516 SBAnational.csv
2018-11-02 04:14:18  262351663 SBAnational_new.csv
2018-11-02 03:14:49  101980700 company_business_type.csv
2018-11-01 07:11:23   10955521 company_default_record.csv
2018-11-01 07:03:29   71184964 extra_company_info.csv
2018-10-29 03:22:11   10025457 loan_record.csv
2018-10-29 02:42:40       1428 t0001-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:41        772 t0002-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:42        785 t0003-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:43        913 t0004-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:43        202 t0005-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:44        203 t0006-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:45        289 t0007-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:46        408 t0008-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:47        204 t0009-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:48        294 t0010-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:49        250 t0011-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:50        316 t0012-10.1080%2F10691898.2018.1434342.csv
2018-10-29 02:42:50       2346 t0013-10.1080%2F10691898.2018.1434342.csv
2018-11-02 04:07:30     397449 ztf1.csv
In [121]:
# User define function

def table(no):
    # there are 13 additional table
    return pd.read_csv(path_SBA + 't00%02d'%no + '-10.1080%2F10691898.2018.1434342.csv')

def default_cat_table(data, cat):
    default_cat = data.groupby([cat, 'default']).count().max(1).unstack()
    default_cat['ALL'] = data[cat].value_counts()
    default_cat['default_rate'] = (default_cat[1] / default_cat['ALL'])
    default_cat = default_cat.rename(columns = {1: 'Default', 0: 'Non-default'})
    return default_cat
In [9]:
nat = pd.read_csv(path_SBA + 'SBAnational_new.csv', sep = ';', low_memory=False,
                 dtype = {'Zip'   : str,
                          'Zip3d' : str,
                          'Zip5d' : str,
                          'fips'  : str})
In [105]:
nat.head().T
Out[105]:
0 1 2 3 4
LoanNr_ChkDgt 1000014003 1000024006 1000034009 1000044001 1000054004
Name ABC HOBBYCRAFT LANDMARK BAR & GRILLE (THE) WHITLOCK DDS, TODD M. BIG BUCKS PAWN & JEWELRY, LLC ANASTASIA CONFECTIONS, INC.
City EVANSVILLE NEW PARIS BLOOMINGTON BROKEN ARROW ORLANDO
State IN IN IN OK FL
Zip 47711 46526 47401 74012 32801
Bank FIFTH THIRD BANK 1ST SOURCE BANK GRANT COUNTY STATE BANK 1ST NATL BK & TR CO OF BROKEN FLORIDA BUS. DEVEL CORP
BankState OH IN IN OK FL
NAICS 451120 722410 621210 0 0
ApprovalDate 1997-02-28 1997-02-28 1997-02-28 1997-02-28 1997-02-28
ApprovalFY 1997 1997 1997 1997 1997
Term 84 60 180 60 240
NoEmp 4 2 7 2 14
NewExist 2 2 1 1 1
CreateJob 0 0 0 0 7
RetainedJob 0 0 0 0 7
FranchiseCode 0 0 0 0 0
UrbanRural 0 0 0 0 0
RevLineCr 0 0 0 0 0
LowDoc 1 1 0 1 0
ChgOffDate NaN NaN NaN NaN NaN
DisbursementDate 1999-02-28 1997-05-31 1997-12-31 1997-06-30 1997-05-14
DisbursementGross 60000 40000 287000 35000 229000
BalanceGross 0 0 0 0 0
MIS_Status P I F P I F P I F P I F P I F
ChgOffPrinGr 0 0 0 0 0
GrAppv 60000 40000 287000 35000 229000
SBA_Appv 48000 32000 215250 28000 229000
default 0 0 0 0 0
Zip5d 47711 46526 47401 74012 32801
Zip3d 477 465 474 740 328
SBA_ratio 0.8 0.8 0.75 0.8 1
Zip_length 5 5 5 5 5
RealEstate 0 0 0 0 1
NAICS_default_rate 23 22 10 NaN NaN
NAICS_group 44 72 62 0 0
Name2 ABC HOBBYCRAFT|IN(47711) LANDMARK BAR & GRILLE (THE)|IN(46526) WHITLOCK DDS, TODD M.|IN(47401) BIG BUCKS PAWN & JEWELRY, LLC|OK(74012) ANASTASIA CONFECTIONS, INC.|FL(32801)
loan_start 1997 1997 1997 1997 1997
loan_record_dict {1997: 1} {1997: 1} {1997: 1} {1997: 1} {1997: 1}
suffix NO SUFFIX NO SUFFIX NO SUFFIX LLC INC
Loan_age 0 0 0 0 0
Previous_loan 0 0 0 0 0
default_record_dict NaN NaN NaN NaN NaN
default_times 0 0 0 0 0
zip 47711 46526 47401 74012 32801
fips 18163 18039 18013 40143 12095
BusinessType INDIVIDUAL CORPORATION INDIVIDUAL CORPORATION NaN

Cases in year

In [125]:
default_cat_table(nat, 'ApprovalFY').ALL.iplot(kind = 'bar', title = 'SBA cases',
                                               yTitle = 'no. of cases', xTitle = 'Year')
In [127]:
default_cat_table(nat, 'ApprovalFY').loc[1990:2014].default_rate.iplot(kind = 'bar', title = 'SBA default rate',
                                               yTitle = 'Default rate', xTitle = 'Year')

NAICS group

In [135]:
nace_group = table(3)
nace_group['sector_group'] = nace_group.Sector.str[:2]
nace_group.head()
Out[135]:
Sector Description sector_group
0 11 Agriculture, forestry, fishing and hunting 11
1 21 Mining, quarrying, and oil and gas extraction 21
2 22 Utilities 22
3 23 Construction 23
4 31–33 Manufacturing 31
In [144]:
default_cat_table2 = default_cat_table(nat, 'NAICS_group').reset_index()

default_cat_table2['NAICS_group'] = default_cat_table2['NAICS_group'].astype(str)
default_cat_table2 = default_cat_table2.merge(nace_group, how = 'left', left_on = 'NAICS_group', right_on = 'sector_group')
use_cols = ['Description', 'NAICS_group', 'default_rate', 'Default', 'ALL']
default_cat_table2[use_cols]
Out[144]:
Description NAICS_group default_rate Default ALL
0 NaN 0 0.083230 16783 201646
1 Agriculture, forestry, fishing and hunting 11 0.090272 812 8995
2 Mining, quarrying, and oil and gas extraction 21 0.084819 157 1851
3 Utilities 22 0.141994 94 662
4 Construction 23 0.232558 15463 66491
5 Manufacturing 31 0.153722 10438 67902
6 Wholesale trade 42 0.194777 9480 48671
7 Retail trade 44 0.227346 28867 126974
8 Transportation and warehousing 48 0.265051 5939 22407
9 Information 51 0.248284 2821 11362
10 Finance and insurance 52 0.284266 2692 9470
11 Real estate and rental and leasing 53 0.287312 3904 13588
12 Professional, scientific, and technical services 54 0.190766 12957 67921
13 Management of companies and enterprises 55 0.101562 26 256
14 Administrative and support and waste managemen... 56 0.235513 7661 32529
15 Educational services 61 0.242462 1552 6401
16 Health care and social assistance 62 0.103793 5736 55264
17 Arts, entertainment, and recreation 71 0.206144 3013 14616
18 Accommodation and food services 72 0.220438 14882 67511
19 Other services (except public administration) 81 0.196552 14229 72393
20 Public administration 92 0.154185 35 227
In [153]:
default_cat_table2.set_index('Description').default_rate.sort_values().iplot(kind = 'bar', title = 'Default on sectors', 
                                                                            yTitle = 'Default rate')

Business type

In [154]:
default_cat_table(nat, 'BusinessType')
Out[154]:
default Non-default Default ALL default_rate
BusinessType
CORPORATION 386701 92635 479336 0.193257
INDIVIDUAL 157728 38687 196415 0.196966
PARTNERSHIP 26998 3123 30121 0.103682

Loan term

In [161]:
nat['Term_year'] = nat.Term.apply(lambda x: round(x/12))
In [166]:
default_cat_table(nat, 'Term_year').head()
Out[166]:
default Non-default Default ALL default_rate
Term_year
0.0 2489.0 7864.0 10353 0.759587
1.0 22221.0 14537.0 36758 0.395479
2.0 9413.0 16973.0 26386 0.643258
3.0 21374.0 22609.0 43983 0.514040
4.0 17435.0 24939.0 42374 0.588545

It appears that SBA may put term to 0 if the cases are default. Therefore it is not recommended to use this variable. However, this variable were used to create the other variable 'RealEstate' because banks only approve long loan if the company is an owner of properties.

Suffix

The last word of the company name may indicate the legal type / type of business

In [177]:
default_cat_table(nat, 'suffix').sort_values('default_rate', ascending = False).head(10)
Out[177]:
default Non-default Default ALL default_rate
suffix
LLC 59218 18110 77328 0.234197
LL 2102 611 2713 0.225212
CORP 16047 4271 20318 0.210208
ENTERPRISES 2353 621 2974 0.208810
CONSTRUCTION 2395 618 3013 0.205111
SALON 1558 402 1960 0.205102
IN 6161 1376 7537 0.182566
SERVICE 11988 2625 14613 0.179635
NO SUFFIX 349956 75430 425386 0.177321
RESTAURANT 7604 1583 9187 0.172309

FranchiseCode

In [178]:
default_cat_table(nat, 'FranchiseCode')
Out[178]:
default Non-default Default ALL default_rate
FranchiseCode
0 695724 149681 845405 0.177052
1 43872 7860 51732 0.151937

UrbanRural

In [179]:
default_cat_table(nat, 'UrbanRural')
Out[179]:
default Non-default Default ALL default_rate
UrbanRural
0 299840 22962 322802 0.071133
1 354409 114866 469275 0.244773
2 85347 19713 105060 0.187636

RevLineCr

Revolving line of credit

Revolving credit is a line of credit where the customer pays a commitment fee to a financial institution to borrow money, and is then allowed to use the funds when needed. It usually is used for operating purposes and the amount drawn can fluctuate each month depending on the customer's current cash flow needs. Revolving lines of credit can be taken out by corporations or individuals.

https://www.investopedia.com/terms/r/revolvingcredit.asp#ixzz5Vgppy7tP

In [180]:
default_cat_table(nat, 'RevLineCr')
Out[180]:
default Non-default Default ALL default_rate
RevLineCr
0 589927 106551 696478 0.152985
1 149669 50990 200659 0.254113

LowDoc

In [181]:
default_cat_table(nat, 'LowDoc')
Out[181]:
default Non-default Default ALL default_rate
LowDoc
0.0 634910 146060 780970 0.187024
1.0 100264 9904 110168 0.089899

Program for small loan that require much less documents

Past records

Past record was based on if that company has previous appearence in that dataset (e.g. previous loan, when did the first loan made and default history.)

In [172]:
default_cat_table(nat, 'Loan_age')
Out[172]:
default Non-default Default ALL default_rate
Loan_age
0 709753.0 153200.0 862953 0.177530
1 10115.0 1996.0 12111 0.164809
2 6560.0 915.0 7475 0.122408
3 4053.0 503.0 4556 0.110404
4 2685.0 306.0 2991 0.102307
5 1885.0 181.0 2066 0.087609
6 1316.0 130.0 1446 0.089903
7 895.0 93.0 988 0.094130
8 680.0 58.0 738 0.078591
9 438.0 48.0 486 0.098765
10 358.0 28.0 386 0.072539
11 248.0 23.0 271 0.084871
12 184.0 19.0 203 0.093596
13 142.0 16.0 158 0.101266
14 99.0 10.0 109 0.091743
15 59.0 4.0 63 0.063492
16 42.0 4.0 46 0.086957
17 30.0 1.0 31 0.032258
18 14.0 3.0 17 0.176471
19 13.0 1.0 14 0.071429
20 9.0 1.0 10 0.100000
21 5.0 1.0 6 0.166667
22 3.0 NaN 3 NaN
23 4.0 NaN 4 NaN
25 1.0 NaN 1 NaN
26 2.0 NaN 2 NaN
27 3.0 NaN 3 NaN
In [171]:
default_cat_table(nat, 'default_times')
Out[171]:
default Non-default Default ALL default_rate
default_times
0 739028.0 155135.0 894163 0.173497
1 532.0 2215.0 2747 0.806334
2 33.0 169.0 202 0.836634
3 3.0 19.0 22 0.863636
4 NaN 2.0 2 1.000000
5 NaN 1.0 1 1.000000

Locations and defaults

In [185]:
default_fips = default_cat_table(nat, 'fips')
default_fips[default_fips.ALL > 10].sort_values(['ALL', 'default_rate'], ascending = False).head()
Out[185]:
default Non-default Default ALL default_rate
fips
06037 30731.0 9211.0 39942 0.230609
17031 9811.0 3744.0 13555 0.276208
06073 10851.0 2133.0 12984 0.164279
48201 9882.0 2636.0 12518 0.210577
04013 9742.0 2673.0 12415 0.215304
In [16]:
fip_ct = nat.fips.value_counts()
fip_ct[fip_ct >= 10].shape
Out[16]:
(2542,)
In [68]:
default_fips = nat.groupby(['fips', 'default']).count().max(1).unstack()
default_fips['ALL'] = nat.fips.value_counts()
default_fips = default_fips.reset_index()
default_fips = default_fips.rename(columns = {'index': 'fips', 
                                              0: 'Non-default',
                                              1: 'Default',
                                              })
default_fips['Default'] = default_fips['Default'].fillna(0)
default_fips['default_rate'] = (default_fips['Default'] / default_fips['ALL']).round(3)
default_fips_1 = default_fips[default_fips.ALL >= 10]
In [69]:
default_fips.head()
Out[69]:
default fips Non-default Default ALL default_rate
0 01001 48.0 14.0 62 0.226
1 01003 256.0 67.0 323 0.207
2 01005 51.0 5.0 56 0.089
3 01007 41.0 8.0 49 0.163
4 01009 229.0 38.0 267 0.142
In [77]:
import plotly.figure_factory as ff

colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"]
In [85]:
binpoint = [500, 1000, 3000, 5000, 10000, 20000]
cs = colorscale[0::2]
fig = ff.create_choropleth(fips = default_fips.fips, values = default_fips.ALL, binning_endpoints = binpoint,
                          colorscale = cs, title='SBA cases', legend_title='no. of cases', round_legend_values=True
                          )
iplot(fig)
In [88]:
binpoint = [10, 100, 1000, 3000, 5000]
cs = colorscale[0::3]
fig = ff.create_choropleth(fips = default_fips.fips, values = default_fips.ALL, binning_endpoints = binpoint,
                          colorscale = cs, title='SBA default cases', legend_title='no. of default cases', round_legend_values=True
                          )
iplot(fig)
In [84]:
binpoint = [0.05, 0.1, 0.2, 0.35, 0.5]
cs = colorscale[0::3]
fig = ff.create_choropleth(fips = default_fips_1.fips, values = default_fips_1.default_rate, binning_endpoints = binpoint,
                          colorscale = cs, title='SBA Default rate', legend_title='Default rate'
                          )
iplot(fig)